Sequence Labeling

Core Concept

Sequence labeling is the task of assigning a label to each element of an input sequence so that the output is a sequence of labels of the same length. The input is typically a sequence of tokens (e.g. words, characters, or time steps) and the output is a corresponding sequence of tags (e.g. part-of-speech, named-entity type, chunk type). Dependencies between labels are often important—adjacent or nearby tags tend to follow valid patterns (e.g. IOB constraints in NER)—so models usually capture local or long-range structure rather than predicting each position independently. This makes sequence labeling a core structured-prediction problem with applications in NLP, speech, and time-series analysis.

Key Characteristics

  • One label per positionOutput length equals input length; each token (or frame) gets exactly one label. This distinguishes sequence labeling from segmentation where boundaries are predicted, though encoding schemes like BIO merge the two (labels encode both segment type and position).
  • Label dependenciesLabels are not independent: valid tag sequences obey constraints (e.g. I-after-B for the same type in BIO; grammatical tag sequences). Models use Markov assumptions (e.g. bigram), linear-chain CRFs, or recurrent/attention networks to capture these dependencies.
  • DecodingFinding the best label sequence given the model is done with Viterbi (for linear-chain factorized scores) or beam search (for neural models); exact decoding is tractable for chain-structured factors.
  • Training signalSupervision is usually full label sequences; training maximizes likelihood (generative HMM) or conditional likelihood (discriminative CRF, neural), or minimizes a margin/structured hinge loss.

Common Applications

  • Part-of-speech (POS) taggingAssigning grammatical category (noun, verb, etc.) to each word in a sentence
  • Named entity recognition (NER)Identifying and typing spans such as person, organization, location (often with BIO/BIOES encoding)
  • Chunking (shallow parsing)Labeling segments such as noun phrase, verb phrase
  • Speech recognition (frame-level)Labeling each frame or segment with a phone or state
  • Biological sequence annotationLabeling residues or bases with function or structure
  • Gesture and activity recognitionLabeling each time step with gesture or activity class

Sequence Labeling Algorithms

Sequence labeling algorithms differ in how they model label dependencies (generative vs discriminative, local vs global), whether they use handcrafted features or learned representations, and in decoding (exact Viterbi vs beam search). Choice depends on data size, need for interpretability, and whether rich context (e.g. transformers) is required.

  • Hidden Markov Model (HMM) – Generative model: transition probabilities between hidden states (labels) and emission probabilities from state to observation (token). Decoding with Viterbi; training with Baum–Welch (EM) or supervised MLE. Interpretable and tractable; limited by Markov assumption and local features.

  • Maximum Entropy Markov Model (MEMM) – Discriminative: per-position classifier (e.g. MaxEnt) conditioned on previous label; trained to maximize conditional likelihood of the label sequence. Decoding with Viterbi; avoids modeling observation distribution but susceptible to label bias.

  • Conditional Random Field (CRF) – Discriminative: globally normalized distribution over label sequences; factors over edges (and optionally nodes) in a linear chain. Trained with conditional MLE; decoding with Viterbi. Handles feature-rich inputs and avoids label bias; widely used as a layer on top of feature or representation extractors.

  • BiLSTM-CRF – Bidirectional LSTM encodes the input sequence; CRF layer on top models label dependencies and is decoded with Viterbi. Combines neural representation with globally normalized label structure; standard baseline for NER and POS.

  • BiLSTM (no CRF) – Bidirectional LSTM with a softmax (or classifier) per position; each position predicted independently given the hidden state. No explicit label dependency; simpler and faster than CRF, often slightly worse on strict metrics.

    • Based on: BiLSTM (per-position classification)
      • Method Group: Neural sequence models
  • Transformer / BERT for sequence labeling – Pre-trained transformer (e.g. BERT) encodes the sequence; a linear or small MLP head on top of each token representation predicts the label. Fine-tuned with cross-entropy (per token). Captures long-range context; state-of-the-art for NER and POS when data and compute allow.